39 research outputs found

    Feature Partitioning for the Co-Traning Setting

    Get PDF
    Supervised learning algorithms rely on availability of labeled data. Labeled data is either scarce or involves substantial human effort in the labeling process. These two factors, along with the abundance of unlabeled data, have spurred research initiatives that exploit unlabeled data to boost supervised learning. This genre of learning algorithms that utilize unlabeled data alongside a small set of labeled data are known as semi-supervised learning algorithms. Data characteristics, such as the presence of a generative model, provide the foundation for applying these learning algorithms. Co-training is one such al gorithm that leverages existence of two redundant views for a data instance. Based on these two views, the co-training algorithm trains two classifiers using the labeled data. The small set of labeled data results in a pair of weak classi fiers. With the help of the unlabeled data the two classifiers alternately boost each other to achieve a high-accuracy classifier. The conditions imposed by the co-training algorithm regarding the data characteristics restrict its application to data that possesses a natural split of the feature set. In this thesis we study the co-training setting and propose to overcome the above mentioned constraint by manufacturing feature splits. We pose and investigate the following questions: 1 . Can a feature split be constructed for a dataset such that the co-training algorithm can be applied to it? 2. If a feature split can be engineered, would splitting the features into more than two partitions give a better classifier? In essence, does moving from co-training (2 classifiers) to k-training (k-classifiers) help? 3. Is there an optimal number of views for a dataset such that k-training leads to an optimal classifier? The task of obtaining feature splits is approached by modeling the problem as a graph partitioning problem. Experiments are conducted on a breadth of text datasets. Results of k-training using constructed feature sets are compared with that of the expectation-maximization algorithm, which has been successful in a semi-supervised setting

    Document Clustering with Bursty Information

    Get PDF
    Nowadays, almost all text corpora, such as blogs, emails and RSS feeds, are a collection of text streams. The traditional vector space model (VSM), or bag-of-words representation, cannot capture the temporal aspect of these text streams. So far, only a few bursty features have been proposed to create text representations with temporal modeling for the text streams. We propose bursty feature representations that perform better than VSM on various text mining tasks, such as document retrieval, topic modeling and text categorization. For text clustering, we propose a novel framework to generate bursty distance measure. We evaluated it on UPGMA, Star and K-Medoids clustering algorithms. The bursty distance measure did not only perform equally well on various text collections, but it was also able to cluster the news articles related to specific events much better than other models

    Proteomic and Phospho-Proteomic Profile of Human Platelets in Basal, Resting State: Insights into Integrin Signaling

    Get PDF
    During atherogenesis and vascular inflammation quiescent platelets are activated to increase the surface expression and ligand affinity of the integrin αIIbβ3 via inside-out signaling. Diverse signals such as thrombin, ADP and epinephrine transduce signals through their respective GPCRs to activate protein kinases that ultimately lead to the phosphorylation of the cytoplasmic tail of the integrin αIIbβ3 and augment its function. The signaling pathways that transmit signals from the GPCR to the cytosolic domain of the integrin are not well defined. In an effort to better understand these pathways, we employed a combination of proteomic profiling and computational analyses of isolated human platelets. We analyzed ten independent human samples and identified a total of 1507 unique proteins in platelets. This is the most comprehensive platelet proteome assembled to date and includes 190 membrane-associated and 262 phosphorylated proteins, which were identified via independent proteomic and phospho-proteomic profiling. We used this proteomic dataset to create a platelet protein-protein interaction (PPI) network and applied novel contextual information about the phosphorylation step to introduce limited directionality in the PPI graph. This newly developed contextual PPI network computationally recapitulated an integrin signaling pathway. Most importantly, our approach not only provided insights into the mechanism of integrin αIIbβ3 activation in resting platelets but also provides an improved model for analysis and discovery of PPI dynamics and signaling pathways in the future

    Diverse Rule Sets

    Full text link
    While machine-learning models are flourishing and transforming many aspects of everyday life, the inability of humans to understand complex models poses difficulties for these models to be fully trusted and embraced. Thus, interpretability of models has been recognized as an equally important quality as their predictive power. In particular, rule-based systems are experiencing a renaissance owing to their intuitive if-then representation. However, simply being rule-based does not ensure interpretability. For example, overlapped rules spawn ambiguity and hinder interpretation. Here we propose a novel approach of inferring diverse rule sets, by optimizing small overlap among decision rules with a 2-approximation guarantee under the framework of Max-Sum diversification. We formulate the problem as maximizing a weighted sum of discriminative quality and diversity of a rule set. In order to overcome an exponential-size search space of association rules, we investigate several natural options for a small candidate set of high-quality rules, including frequent and accurate rules, and examine their hardness. Leveraging the special structure in our formulation, we then devise an efficient randomized algorithm, which samples rules that are highly discriminative and have small overlap. The proposed sampling algorithm analytically targets a distribution of rules that is tailored to our objective. We demonstrate the superior predictive power and interpretability of our model with a comprehensive empirical study against strong baselines

    Proteomic and Phospho-Proteomic Profile of Human Platelets in Basal, Resting State: Insights into Integrin Signaling

    Get PDF
    During atherogenesis and vascular inflammation quiescent platelets are activated to increase the surface expression and ligand affinity of the integrin αIIbβ3 via inside-out signaling. Diverse signals such as thrombin, ADP and epinephrine transduce signals through their respective GPCRs to activate protein kinases that ultimately lead to the phosphorylation of the cytoplasmic tail of the integrin αIIbβ3 and augment its function. The signaling pathways that transmit signals from the GPCR to the cytosolic domain of the integrin are not well defined. In an effort to better understand these pathways, we employed a combination of proteomic profiling and computational analyses of isolated human platelets. We analyzed ten independent human samples and identified a total of 1507 unique proteins in platelets. This is the most comprehensive platelet proteome assembled to date and includes 190 membrane-associated and 262 phosphorylated proteins, which were identified via independent proteomic and phospho-proteomic profiling. We used this proteomic dataset to create a platelet protein-protein interaction (PPI) network and applied novel contextual information about the phosphorylation step to introduce limited directionality in the PPI graph. This newly developed contextual PPI network computationally recapitulated an integrin signaling pathway. Most importantly, our approach not only provided insights into the mechanism of integrin αIIbβ3 activation in resting platelets but also provides an improved model for analysis and discovery of PPI dynamics and signaling pathways in the future
    corecore